Building Transactional Frequency Models
In this workbook we investigate different ways to model the transaction frequency of individual customers, with a view to expanding this approach into a more traditional P/NBD model.
1 Load and Configure Datasets
We first want to load some synthetic transaction data.
customer_cohort_tbl <- read_rds("data/synthdata_singleyear_cohort_tbl.rds")
customer_cohort_tbl %>% glimpse()## Rows: 50,000
## Columns: 4
## $ customer_id <chr> "T13BDJUZAC", "C2TB0LP0C6", "JIKQDKZ26T", "138K9L955K",…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", …
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", …
## $ first_tnx_date <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-0…
customer_simparams_tbl <- read_rds("data/synthdata_singleyear_simparams_tbl.rds")
customer_simparams_tbl %>% glimpse()## Rows: 50,000
## Columns: 8
## $ customer_id <chr> "0010G0VY0J", "00127H7D5N", "0019QVYV9S", "001YCS74VX"…
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1",…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01",…
## $ first_tnx_date <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-…
## $ customer_tau <dbl> 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53…
## $ customer_lambda <dbl> 0.220770473, 0.142113120, 0.243301028, 0.066357589, 0.…
## $ customer_nu <dbl> 1.37056019, 0.40887331, 0.09656006, 0.03381901, 0.0651…
## $ customer_p <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,…
customer_transactions_tbl <- read_rds("data/synthdata_singleyear_transactions_tbl.rds")
customer_transactions_tbl %>% glimpse()## Rows: 698,567
## Columns: 5
## $ customer_id <chr> "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", …
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-01-01 23:31:04, 2019-01-05 10:22:11, 2019-01-09 22…
## $ tnx_amount <dbl> 76.52, 85.76, 72.59, 69.78, 73.71, 83.74, 228.39, 222.85…
Our transaction data is our main input data for this work, so we will show the first few rows of this.
customer_transactions_tbl %>% arrange(tnx_timestamp) %>% head(10)## # A tibble: 10 × 5
## customer_id cohort_qtr cohort_ym tnx_timestamp tnx_amount
## <chr> <chr> <chr> <dttm> <dbl>
## 1 JQP7OEHGAC 2019 Q1 2019 01 2019-01-01 00:00:00 163.
## 2 0K1WX20VB8 2019 Q1 2019 01 2019-01-01 00:00:00 659.
## 3 9S35M13ZXY 2019 Q1 2019 01 2019-01-01 00:00:00 210.
## 4 NJ97964ECA 2019 Q1 2019 01 2019-01-01 00:00:01 35.0
## 5 OMRT6PCB5F 2019 Q1 2019 01 2019-01-01 00:00:03 12228.
## 6 GZ79WBXAT6 2019 Q1 2019 01 2019-01-01 00:00:03 657.
## 7 JXAPDQ0KZ2 2019 Q1 2019 01 2019-01-01 00:00:04 174.
## 8 6QHCMENBNQ 2019 Q1 2019 01 2019-01-01 00:00:06 135.
## 9 R4OG6SE6LM 2019 Q1 2019 01 2019-01-01 00:00:06 847.
## 10 A053QLITIO 2019 Q1 2019 01 2019-01-01 00:00:10 126.
We also want to set up a number of parameters for use in this workbook
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"1.1 Construct Frequency Data
customer_summarystats_tbl <- customer_transactions_tbl %>%
calculate_transaction_summary_stats()
customer_summarystats_tbl %>% glimpse()## Rows: 50,000
## Columns: 9
## $ customer_id <chr> "0010G0VY0J", "00127H7D5N", "0019QVYV9S", "001YCS74VX", "…
## $ tnx_count <int> 6, 6, 14, 6, 50, 77, 25, 4, 11, 4, 26, 9, 8, 5, 1, 3, 7, …
## $ first_tnx_ts <dttm> 2019-01-01 23:31:04, 2019-01-01 14:58:53, 2019-01-01 08:…
## $ last_tnx_ts <dttm> 2019-09-23 00:10:59, 2019-11-04 00:44:13, 2019-12-25 23:…
## $ btyd_count <dbl> 5, 5, 13, 5, 49, 76, 24, 3, 10, 3, 25, 8, 7, 4, 0, 2, 6, …
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 37.71825, 43.77235, 51.23203, 13.76003, 50.62822, 50.3100…
## $ obs_freq <dbl> 0.13256184, 0.11422735, 0.25374752, 0.36337141, 0.9678396…
## $ emp_freq <dbl> 0.09615385, 0.09615385, 0.25000000, 0.09615385, 0.9423076…
We want to view the transactions to get a sense of how regular our transactions are in general.
plot_tbl <- customer_transactions_tbl %>%
group_nest(customer_id, .key = "cust_data") %>%
filter(map_int(cust_data, nrow) > 3) %>%
slice_sample(n = 30) %>%
unnest(cust_data)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))1.2 Construct Data Subsets
For the purposes of saving on time and computation, we construct a subset of
the data first, and rather than constructing different samples each time we
instead construct sets of customer_id to draw upon, then using those list
of values when we want to take a subset.
We do this by randomly shuffling the data and then selecting the top \(n\) values
of customer_id.
shuffle_tbl <- customer_summarystats_tbl %>%
slice_sample(n = nrow(.), replace = FALSE)
id_1000 <- shuffle_tbl %>% head(1000) %>% pull(customer_id) %>% sort()
id_5000 <- shuffle_tbl %>% head(5000) %>% pull(customer_id) %>% sort()
id_10000 <- shuffle_tbl %>% head(10000) %>% pull(customer_id) %>% sort()We now have a list of customer_id values we use to subset the data.
2 Use of Conjugate Priors
Some combinations of likelihood model and prior distribution combine in a convenient way.
In particular, if this combination integrates to a posterior distribution that has the same distribution family as the prior, the prior is said to be a conjugate prior for that likelihood.
The major benefit of this situation is that the integration step can be skipped entirely: the result is that the parameters of conjugate prior are combined with summary statistics of the observed data to get our posterior distribution.
Most commonly-used likelihood models have conjugate priors, and the Poisson distribution has the Gamma as a conjugate prior.
In particular, suppose we have a Gamma distribution prior on \(\lambda\):
\[ \lambda_{prior} \sim \Gamma(r, \alpha) \] and we observe \(x\) events over a period \(n\), our posterior distribution for \(\lambda\) is now.
\[ \lambda_{post} \sim \Gamma(r + x, \alpha + n) \]
2.1 Statistical Inference on Frequency Rates
Suppose we have new customer join our website and observe a number of transactions from that customer over the following year. For now, we ignore churn and assume the customer was an active customer for that whole year.
How do we combine our prior knowledge and observed transactions to perform inference on that customers transaction rate?
From the above section, we see that calculating the posterior distribution for \(\lambda\) is simple, so we now investigate the consequences of this by looking at how different sets of observations and priors combine.
2.2 Initial Examples
In terms of our prior distribution, we set our prior based on average expectations of the population of interest. Suppose the average customer transacts on our site once a month, or once every four weeks.
Since we are using weekly time periods as our reference point, this means that if we were calculating a point estimate it would be 0.25.
For the purposes of Bayesian analysis though, this is insufficient - we need a distribution for \(\lambda\) that captures our prior knowledge.
Narrowing down to a Gamma distribution, the mean for our Gamma is
\[ \bar{x} = \frac{r}{\alpha}. \]
So now we look at a few different parameter combinations with this mean.
x_vals <- seq(0.01, 1, by = 0.001)
y1_vals <- dgamma(x_vals, shape = 0.1, rate = 0.4)
y2_vals <- dgamma(x_vals, shape = 1.0, rate = 4.0)
y3_vals <- dgamma(x_vals, shape = 2.0, rate = 8.0)
y4_vals <- dgamma(x_vals, shape = 10.0, rate = 40.0)
plot_tbl <- tibble(
x = x_vals,
`Gamma(0.1, 0.4)` = y1_vals,
`Gamma(1, 4)` = y2_vals,
`Gamma(2, 8)` = y3_vals,
`Gamma(10, 40)` = y4_vals
) %>%
pivot_longer(
!x,
names_to = "label",
values_to = "dens"
)
ggplot(plot_tbl) +
geom_line(aes(x = x, y = dens, colour = label)) +
labs(
x = "Value",
y = "Density",
title = "Comparison Density Plot for Gamma"
)So we now look at the effect of priors when combined with observed data.
Suppose over the period of a year a customer transacts 13 times in the year - this is fully consistent with the prior mean.
x_vals <- seq(0.01, 1, by = 0.001)
prior_1_4_dens <- dgamma(x_vals, shape = 1, rate = 4)
post_1_4_dens <- dgamma(x_vals, shape = 1 + 13, rate = 4 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_1_4_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_1_4_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(1, 4) Prior"
)We now construct a similar comparison with a stronger prior on \(\lambda\).
x_vals <- seq(0.01, 1, by = 0.001)
prior_5_20_dens <- dgamma(x_vals, shape = 5, rate = 20)
post_5_20_dens <- dgamma(x_vals, shape = 5 + 13, rate = 20 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_5_20_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_5_20_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(5, 20) Prior"
)It might be more useful to compare the two posteriors directly - both are based on the same observed data.
ggplot() +
geom_line(aes(x = x_vals, y = post_1_4_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_5_20_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Comparison of Posterior Densities Based on 13 Observed Events"
)2.3 Prior and Data Mismatch
It is worth exploring how the prior and data interact when the disagree.
We use a few Gamma priors with a mean value of 0.5 and the same data (13 events in 52 weeks) and see what the effect is.
x_vals <- seq(0.01, 1, by = 0.001)
prior_1_2_dens <- dgamma(x_vals, shape = 1, rate = 2)
post_1_2_dens <- dgamma(x_vals, shape = 1 + 13, rate = 2 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_1_4_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_1_4_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(1, 2) Prior"
)We try a tighter prior around the value (though the weighting is still in favour of the data)
x_vals <- seq(0.01, 1, by = 0.001)
prior_5_10_dens <- dgamma(x_vals, shape = 5, rate = 10)
post_5_10_dens <- dgamma(x_vals, shape = 5 + 13, rate = 10 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_5_10_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_5_10_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(5, 10) Prior"
)We also want to try this with a prior that is “stronger” than the data.
x_vals <- seq(0.01, 1, by = 0.001)
prior_50_100_dens <- dgamma(x_vals, shape = 50, rate = 100)
post_50_100_dens <- dgamma(x_vals, shape = 50 + 13, rate = 100 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_50_100_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_50_100_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(50, 100) Prior"
)We finally want to check what happens with a very weak prior in the wrong location.
x_vals <- seq(0.01, 1, by = 0.001)
prior_1_1_dens <- dgamma(x_vals, shape = 1, rate = 1)
post_1_1_dens <- dgamma(x_vals, shape = 1 + 13, rate = 1 + 52)
ggplot() +
geom_line(aes(x = x_vals, y = prior_1_1_dens), colour = "black") +
geom_line(aes(x = x_vals, y = post_1_1_dens), colour = "red") +
labs(
x = "Value",
y = "Density",
title = "Prior and Posterior Comparison with Gamma(1, 1) Prior"
)3 Construct Flat Frequency Model
For first frequency model we fit each customer’s transaction frequency independent of the other customers. We start with the estimation that the average customer frequency is about one transaction every four weeks (once a month) so this value determines our initial prior distribution for the transaction frequency.
For our initial fit, we use only 1,000 eligible customers to reduce computation time and construct a dataset used to fit these models.
fit_1000_data_tbl <- customer_summarystats_tbl %>%
filter(customer_id %in% id_1000)
fit_1000_data_tbl %>% glimpse()## Rows: 1,000
## Columns: 9
## $ customer_id <chr> "00JP8DATDZ", "020O4TLBUO", "02AIZVL9MO", "03V7QNX3MY", "…
## $ tnx_count <int> 2, 3, 3, 28, 11, 12, 27, 17, 4, 5, 26, 7, 1, 5, 3, 13, 19…
## $ first_tnx_ts <dttm> 2019-01-01 14:03:20, 2019-01-01 04:33:37, 2019-01-01 23:…
## $ last_tnx_ts <dttm> 2019-04-08 22:11:33, 2019-01-24 20:30:27, 2019-07-25 20:…
## $ btyd_count <dbl> 1, 2, 2, 27, 10, 11, 26, 16, 3, 4, 25, 6, 0, 4, 2, 12, 18…
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 13.905576, 3.380637, 29.269596, 50.074107, 50.648265, 47.…
## $ obs_freq <dbl> 0.07191360, 0.59160452, 0.06833029, 0.53920082, 0.1974401…
## $ emp_freq <dbl> 0.01923077, 0.03846154, 0.03846154, 0.51923077, 0.1923076…
We do the same for the 10k customers.
fit_10000_data_tbl <- customer_summarystats_tbl %>%
filter(customer_id %in% id_10000)
fit_10000_data_tbl %>% glimpse()## Rows: 10,000
## Columns: 9
## $ customer_id <chr> "0010G0VY0J", "001YCS74VX", "003L6XXKCJ", "00G07U1V2X", "…
## $ tnx_count <int> 6, 6, 77, 1, 8, 2, 48, 27, 3, 6, 17, 27, 30, 13, 1, 7, 16…
## $ first_tnx_ts <dttm> 2019-01-01 23:31:04, 2019-01-01 23:00:08, 2019-01-01 02:…
## $ last_tnx_ts <dttm> 2019-09-23 00:10:59, 2019-04-08 06:41:11, 2019-12-19 06:…
## $ btyd_count <dbl> 5, 5, 76, 0, 7, 1, 47, 26, 2, 5, 16, 26, 29, 12, 0, 6, 15…
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 37.718247, 13.760026, 50.310053, 52.000000, 38.361259, 13…
## $ obs_freq <dbl> 0.13256184, 0.36337141, 1.51063247, 0.00000000, 0.1824757…
## $ emp_freq <dbl> 0.09615385, 0.09615385, 1.46153846, 0.00000000, 0.1346153…
3.1 Fit Stan Model
We start by fitting our first Stan model, fitting a Poisson rate for each individual customer.
## data {
## int<lower=1> n; // count of observations
##
## array[n] int<lower=0> btyd_count; // observed number of transactions
## vector<lower=0>[n] tnx_weeks; // observed time-period of transactions
## }
##
## parameters {
## vector<lower=0>[n] lambda;
## }
##
## model {
## target += poisson_lpmf(btyd_count | lambda .* tnx_weeks);
## }
##
## generated quantities {
## vector[n] log_lik;
##
## for (i in 1:n) {
## log_lik[i] = poisson_lpmf(btyd_count[i] | lambda[i] * tnx_weeks[i]);
## }
## }
We now compile this model using CmdStanR.
freqmodel_flat_stanmodel <- cmdstan_model(
"stan_code/freqmodel_flat.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "freqmodel_flat"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
select(customer_id, btyd_count, tnx_weeks = all_weeks) %>%
compose_data()
freqmodel_flat_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4201,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 finished in 6.7 seconds.
## Chain 3 finished in 6.5 seconds.
## Chain 4 finished in 6.3 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 7.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 6.6 seconds.
## Total execution time: 7.3 seconds.
freqmodel_flat_stanfit$summary()## # A tibble: 2,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.30e+3 -4.30e+3 23.2 23.1 -4.34e+3 -4.27e+3 1.00 623.
## 2 lambda[1] 3.86e-2 3.19e-2 0.0269 0.0230 7.19e-3 9.02e-2 1.00 2056.
## 3 lambda[2] 5.74e-2 5.07e-2 0.0338 0.0308 1.65e-2 1.24e-1 1.00 2031.
## 4 lambda[3] 5.88e-2 5.33e-2 0.0332 0.0318 1.74e-2 1.22e-1 1.00 2694.
## 5 lambda[4] 5.37e-1 5.32e-1 0.101 0.103 3.82e-1 7.15e-1 1.00 2518.
## 6 lambda[5] 2.11e-1 2.04e-1 0.0640 0.0628 1.18e-1 3.30e-1 0.999 2357.
## 7 lambda[6] 2.33e-1 2.27e-1 0.0682 0.0662 1.29e-1 3.55e-1 1.00 2764.
## 8 lambda[7] 5.19e-1 5.11e-1 0.105 0.107 3.66e-1 7.03e-1 1.00 2570.
## 9 lambda[8] 3.26e-1 3.19e-1 0.0810 0.0792 2.05e-1 4.65e-1 1.00 2433.
## 10 lambda[9] 7.73e-2 7.11e-2 0.0384 0.0356 2.65e-2 1.49e-1 1.00 2389.
## # … with 1,991 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
freqmodel_flat_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_flat-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_freqmodel_flat-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_freqmodel_flat-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_freqmodel_flat-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
3.1.1 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]", "lambda[5]", "lambda[6]"
)
freqmodel_flat_stanfit$draws(inc_warmup = TRUE) %>%
mcmc_trace(
pars = parameter_subset,
n_warmup = 500
) +
ggtitle("Full Traceplots of Some Lambda Values")As the warmup is skewing the y-axis somewhat, we repeat this process without the warmup.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]", "lambda[5]", "lambda[6]"
)
freqmodel_flat_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(
pars = parameter_subset
) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
freqmodel_flat_stanfit %>%
rhat(pars = "lambda") %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
freqmodel_flat_stanfit %>%
neff_ratio(pars = "lambda") %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
freqmodel_flat_stanfit$draws() %>%
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Lambda Values")In practice we do not always require a comprehensive exploration of the diagnostics, but it is good practice to run through the various visualisations when we fit a model to ensure our sample is valid.
3.2 Check Model Fit
We now need to check the parameters of this fit against the data to see how effective our model is at capturing the data. In this case we have the benefit of knowing the ‘true’ data, and so we compare our model output against the input parameters.
freqmodel_flat_validation_tbl <- freqmodel_flat_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_flat_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.03152330, 0.01923300, 0.06058230, 0.03693450, 0.0235…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
Having constructed the validation data we now want to check the quantile of each ‘true’ value in the posterior distribution for the parameter. If our model is valid, this distribution will be uniform on \([0, 1]\).
plotvalid_tbl <- freqmodel_flat_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(plotvalid_tbl) +
geom_point(aes(x = q_val, y = customer_lambda)) +
labs(
x = "Quantile",
y = "Customer Lambda",
title = "Scatterplot of q-Value against Lambda"
)3.3 Fit Larger Model
We now repeat this model with a larger dataset.
stan_modelname <- "freqmodel_flat_10k"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_10000_data_tbl %>%
select(customer_id, btyd_count, tnx_weeks = all_weeks) %>%
compose_data()
freqmodel_flat_10k_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4202,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 89.3 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 89.9 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 90.3 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 91.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 90.1 seconds.
## Total execution time: 91.3 seconds.
freqmodel_flat_10k_stanfit$summary()## # A tibble: 20,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.31e+4 -4.31e+4 70.9 70.1 -4.32e+4 -4.30e+4 1.00 569.
## 2 lambda[1] 1.16e-1 1.11e-1 0.0464 0.0433 5.00e-2 2.01e-1 1.00 2728.
## 3 lambda[2] 1.14e-1 1.08e-1 0.0455 0.0448 5.17e-2 1.99e-1 1.00 3480.
## 4 lambda[3] 1.48e+0 1.47e+0 0.163 0.158 1.22e+0 1.77e+0 1.01 3725.
## 5 lambda[4] 1.91e-2 1.31e-2 0.0189 0.0137 9.50e-4 5.62e-2 1.00 2402.
## 6 lambda[5] 1.52e-1 1.46e-1 0.0549 0.0541 7.52e-2 2.55e-1 1.00 2418.
## 7 lambda[6] 3.84e-2 3.16e-2 0.0275 0.0227 7.49e-3 9.02e-2 1.00 3123.
## 8 lambda[7] 9.20e-1 9.12e-1 0.135 0.136 7.10e-1 1.16e+0 1.00 2913.
## 9 lambda[8] 5.20e-1 5.13e-1 0.102 0.102 3.64e-1 7.00e-1 1.00 4974.
## 10 lambda[9] 5.67e-2 4.95e-2 0.0330 0.0301 1.56e-2 1.19e-1 1.00 2072.
## # … with 19,991 more rows, and 1 more variable: ess_tail <dbl>
We first want to check the HMC diagnostics.
freqmodel_flat_10k_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_flat_10k-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_flat_10k-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_flat_10k-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_flat_10k-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
3.4 Check Larger Model Fit
We now repeat the validation exercise once more, extracting the posterior customer \(\lambda\) values and comparing them to our input values.
freqmodel_flat_10k_validation_tbl <- freqmodel_flat_10k_stanfit %>%
recover_types(fit_10000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_flat_10k_validation_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 4
## $ customer_id <chr> "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", "0010G0VY0J"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.1198230, 0.1016800, 0.1271920, 0.1051370, 0.0854725,…
## $ customer_lambda <dbl> 0.2207705, 0.2207705, 0.2207705, 0.2207705, 0.2207705,…
We also construct validation plots for this.
plotvalid_tbl <- freqmodel_flat_10k_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(plotvalid_tbl) +
geom_point(aes(x = q_val, y = customer_lambda), alpha = 0.2) +
labs(
x = "Quantile",
y = "Customer Lambda",
title = "Scatterplot of q-Value against Lambda"
)3.5 Construct Observed-Time Frequency Model
We now want to update our approach to use the observed duration of the transactions rather than using a constant time period of 52 weeks.
stan_modelname <- "freqmodel_flat_obs"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
transmute(customer_id, btyd_count, tnx_weeks) %>%
compose_data()
freqmodel_flat_obs_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4203,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 5.4 seconds.
## Chain 3 finished in 5.1 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 finished in 5.4 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 5.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 5.3 seconds.
## Total execution time: 5.9 seconds.
freqmodel_flat_obs_stanfit$summary()## # A tibble: 2,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.13e+3 -4.13e+3 22.7 22.4 -4.16e+3 -4.09e+3 1.00 945.
## 2 lambda[1] 1.45e-1 1.23e-1 0.106 0.0862 2.22e-2 3.55e-1 1.00 1764.
## 3 lambda[2] 9.05e-1 8.11e-1 0.508 0.458 2.66e-1 1.93e+0 1.00 2136.
## 4 lambda[3] 1.02e-1 8.99e-2 0.0594 0.0532 2.61e-2 2.17e-1 1.01 2212.
## 5 lambda[4] 5.59e-1 5.51e-1 0.107 0.106 3.98e-1 7.44e-1 1.00 2229.
## 6 lambda[5] 2.19e-1 2.12e-1 0.0652 0.0626 1.23e-1 3.39e-1 1.00 2137.
## 7 lambda[6] 2.55e-1 2.49e-1 0.0724 0.0717 1.48e-1 3.81e-1 1.00 2068.
## 8 lambda[7] 5.18e-1 5.11e-1 0.0997 0.0954 3.71e-1 6.98e-1 1.00 2024.
## 9 lambda[8] 3.29e-1 3.22e-1 0.0786 0.0770 2.12e-1 4.67e-1 1.00 2194.
## 10 lambda[9] 7.93e-2 7.36e-2 0.0389 0.0364 2.80e-2 1.52e-1 1.00 2410.
## # … with 1,991 more rows, and 1 more variable: ess_tail <dbl>
Now that we have fit our models, we repeat our validation model much like we did the last time. We expect this model to deviate from our first model however as we have a less-reliable estimate of the observation time for each customer - as it is now calculated as ending at the last observation.
freqmodel_flat_obs_validation_tbl <- freqmodel_flat_obs_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_flat_obs_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.2040730, 0.0686289, 0.1629100, 0.3354840, 0.0417703,…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
We also construct validation plots for this.
plotvalid_tbl <- freqmodel_flat_obs_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)ggplot(plotvalid_tbl) +
geom_point(aes(x = q_val, y = customer_lambda), alpha = 0.2) +
labs(
x = "Quantile",
y = "Customer Lambda",
title = "Scatterplot of q-Value against Lambda"
)3.6 Construct Posterior Distribution Comparisons
We now combine the outputted posterior distributions and compare them together.
comparison_distrib_tbl <- list(
`Fixed Time` = freqmodel_flat_validation_tbl,
`Observed Time` = freqmodel_flat_obs_validation_tbl
) %>%
bind_rows(.id = "label")
comparison_distrib_tbl %>% glimpse()## Rows: 4,000,000
## Columns: 5
## $ label <chr> "Fixed Time", "Fixed Time", "Fixed Time", "Fixed Time"…
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.03152330, 0.01923300, 0.06058230, 0.03693450, 0.0235…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
Now that we have this data we can construct a number of visualisations to help us compare these two distributions on a per-customer basis.
comparison_summary_tbl <- comparison_distrib_tbl %>%
group_by(label, customer_id) %>%
summarise(
.groups = "drop",
labels = c("q10", "q25", "q50", "q75", "q90", "mean", "customer_lambda"),
vals = c(quantile(post_lambda, probs = c(0.1, 0.25, 0.50, 0.75, 0.90)),
mean(post_lambda),
unique(customer_lambda))
) %>%
pivot_wider(
names_from = labels,
values_from = vals
)
plot_tbl <- comparison_summary_tbl %>%
group_nest(customer_id) %>%
slice_sample(n = 30) %>%
unnest(data)
ggplot(plot_tbl) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 2) +
geom_point(aes(x = customer_id, y = customer_lambda)) +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))As there is a broad selection of different lambda values, we draw a random
selection of customer_id for different ranges of these values.
plotdata_tbl <- comparison_summary_tbl %>%
mutate(
lambda_segment = case_when(
(customer_lambda > 0.00 & customer_lambda <= 0.30) ~ "SEG1",
(customer_lambda > 0.30 & customer_lambda <= 0.60) ~ "SEG2",
(customer_lambda > 0.60 & customer_lambda <= 0.90) ~ "SEG3",
(customer_lambda > 0.90 & customer_lambda <= 1.20) ~ "SEG4",
(customer_lambda > 1.20) ~ "SEG5",
TRUE ~ "ERROR"
)
) %>%
group_nest(lambda_segment) %>%
mutate(
sample_data = map(data, ~ .x %>% group_nest(customer_id) %>% slice_sample(n = 10))
) %>%
select(lambda_segment, sample_data) %>%
unnest(sample_data) %>%
unnest(data)
ggplot(plotdata_tbl %>% mutate(customer_id = str_trunc(customer_id, width = 7))) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 2) +
geom_point(
aes(x = customer_id, y = customer_lambda, group = label),
position = position_dodge(width = 0.75)) +
facet_wrap(vars(lambda_segment), scales = "free") +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 6),
legend.position = "bottom"
)3.7 Estimate Posterior Interval Coverage
One final validity check is to see how often the ‘true’ lambda value falls
within our posterior intervals. We create a boolean variable for both intervals
(the 50% and the 80%) and then check the proportion of values against the
width.
coverage_stats_tbl <- comparison_summary_tbl %>%
mutate(
up_50 = customer_lambda > q75,
dn_50 = customer_lambda < q25,
up_80 = customer_lambda < q10,
dn_80 = customer_lambda > q90,
cover_50 = (!up_50 & !dn_50),
cover_80 = (!up_80 & !dn_80)
)
coverage_stats_tbl %>%
group_by(label) %>%
summarise(
.groups = "drop",
prop_50 = sum(cover_50) / n(),
prop_80 = sum(cover_80) / n(),
prop_up50 = sum(up_50) / n(),
prop_dn50 = sum(dn_50) / n(),
prop_up80 = sum(up_80) / n(),
prop_dn80 = sum(dn_80) / n()
)## # A tibble: 2 × 7
## label prop_50 prop_80 prop_up50 prop_dn50 prop_up80 prop_dn80
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Fixed Time 0.5 0.79 0.164 0.336 0.157 0.053
## 2 Observed Time 0.424 0.711 0.107 0.469 0.257 0.032
These numbers line up with what we expect, though we can see that using the
observed time as our input introduces an upward bias in inferred values for
lambda. We could assess further using bootstrap techniques, but I do not
think it is necessary.
We will leave repeating the above analysis for the larger datasets as an exercise.
4 Adding Priors to the Model
We now want to add priors to our model to improve our inferences. Our current
model provides no input for our prior knowledge on the values of lambda - but
a bit of thought will show this is not correct. For example, we know that a
value above 10.0 is almost impossible, and most customers will have a lambda
rate between 0 and 1 say, averaging around roughly 0.25.
A good choice for this distribution is the Gamma with a shape parameter of 1 and a ‘rate’ parameter of 4.
We can check what this looks like by simulation:
lambda_vals <- rgamma(10000, shape = 1, rate = 4)
ggplot() +
geom_histogram(aes(x = lambda_vals), bins = 50) +
labs(
x = "Lambda",
y = "Frequency",
title = "Histogram of Lambda Values Simulated from Gamma(1, 4)"
)The above distribution looks reasonable, though perhaps we would want a bit more probability mass at the higher end.
lambda_vals <- rgamma(10000, shape = 0.25, rate = 1.0)
ggplot() +
geom_histogram(aes(x = lambda_vals), bins = 50) +
labs(
x = "Lambda",
y = "Frequency",
title = "Histogram of Lambda Values Simulated from Gamma(0.25, 1)"
)This distribution looks a lot more skewed, so we also plot this histogram on a log-scale.
ggplot() +
geom_histogram(aes(x = lambda_vals), bins = 50) +
scale_x_log10() +
labs(
x = "Lambda",
y = "Frequency",
title = "Histogram of Lambda Values Simulated from Gamma(0.25, 1)"
)4.1 Construct Stan Model with Priors on Lambda
This is ‘prior’ knowledge and we can set up these priors for the parameters within our model to allow for this prior knowledge.
## data {
## int<lower=1> n; // count of observations
##
## array[n] int<lower=0> btyd_count; // observed number of transactions
## vector<lower=0>[n] tnx_weeks; // observed time-period of transactions
##
## real r;
## real alpha;
## }
##
## parameters {
## vector<lower=0>[n] lambda;
## }
##
## model {
## lambda ~ gamma(r, alpha);
##
## target += poisson_lpmf(btyd_count | lambda .* tnx_weeks);
## }
##
## generated quantities {
## vector[n] log_lik;
##
## for (i in 1:n) {
## log_lik[i] = poisson_lpmf(btyd_count[i] | lambda[i] * tnx_weeks[i]);
## }
## }
You can see we have now added code to say that our posterior inference on
each customer value for lambda is the combination of a prior Gamma
distribution and our Poisson likelihood for the observed counts.
We now compile this model using CmdStanR.
freqmodel_prior_stanmodel <- cmdstan_model(
"stan_code/freqmodel_prior.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)Having compiled the Stan model, we now fit it with our data.
stan_modelname <- "freqmodel_prior"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
transmute(customer_id, btyd_count, tnx_weeks) %>%
compose_data(
r = 0.25,
alpha = 1
)
freqmodel_prior_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4204,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 11.8 seconds.
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 12.5 seconds.
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 12.8 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 13.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 12.6 seconds.
## Total execution time: 13.5 seconds.
freqmodel_prior_stanfit$summary()## # A tibble: 2,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -3.18e+3 -3.18e+3 23.3 24.1 -3.22e+3 -3.14e+3 1.00 889.
## 2 lambda[1] 8.16e-2 6.23e-2 0.0718 0.0553 7.50e-3 2.21e-1 1.00 1899.
## 3 lambda[2] 5.17e-1 4.40e-1 0.359 0.304 1.00e-1 1.24e+0 1.00 2214.
## 4 lambda[3] 7.55e-2 6.30e-2 0.0518 0.0433 1.62e-2 1.72e-1 1.00 2642.
## 5 lambda[4] 5.33e-1 5.26e-1 0.104 0.108 3.77e-1 7.12e-1 1.00 2162.
## 6 lambda[5] 1.97e-1 1.89e-1 0.0613 0.0580 1.10e-1 3.11e-1 1.01 2670.
## 7 lambda[6] 2.34e-1 2.25e-1 0.0693 0.0696 1.35e-1 3.56e-1 1.00 1919.
## 8 lambda[7] 4.98e-1 4.93e-1 0.0986 0.0960 3.50e-1 6.71e-1 1.00 2552.
## 9 lambda[8] 3.05e-1 3.00e-1 0.0755 0.0753 1.91e-1 4.34e-1 1.00 1812.
## 10 lambda[9] 6.29e-2 5.66e-2 0.0352 0.0326 1.75e-2 1.29e-1 1.00 2022.
## # … with 1,991 more rows, and 1 more variable: ess_tail <dbl>
Once again, we want to check the HMC diagnostics.
freqmodel_prior_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_prior-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
4.2 Check Posterior Lambda Values
We repeat our validation exercise from before, comparing the outputs of the
posterior distribution for each lambda against the known true value.
freqmodel_prior_validation_tbl <- freqmodel_prior_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_prior_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.12767900, 0.22879800, 0.01763790, 0.22609800, 0.0195…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
We now calculate the q-values of each parameter against the posterior distribution and plot a histogram of these q-values.
plotvalid_tbl <- freqmodel_prior_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)4.3 Comparing the Uniform Prior and Gamma Prior Posteriors
Now we have both posterior distributions based on the observed lifetime of the customer and it is useful to compare the two distributions.
comparison_distrib_tbl <- list(
`Flat` = freqmodel_flat_obs_validation_tbl,
`Prior` = freqmodel_prior_validation_tbl
) %>%
bind_rows(.id = "label")
comparison_distrib_tbl %>% glimpse()## Rows: 4,000,000
## Columns: 5
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.2040730, 0.0686289, 0.1629100, 0.3354840, 0.0417703,…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
We now can compare the outputs of the two posterior distributions based on using a uniform prior and Gamma prior.
comparison_summary_tbl <- comparison_distrib_tbl %>%
group_by(label, customer_id) %>%
summarise(
.groups = "drop",
labels = c("q10", "q25", "q50", "q75", "q90", "mean", "customer_lambda"),
vals = c(quantile(post_lambda, probs = c(0.1, 0.25, 0.50, 0.75, 0.90)),
mean(post_lambda),
unique(customer_lambda))
) %>%
pivot_wider(
names_from = labels,
values_from = vals
)
comparison_summary_tbl %>% glimpse()## Rows: 2,000
## Columns: 9
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00JP8DATDZ", "020O4TLBUO", "02AIZVL9MO", "03V7QNX3MY"…
## $ q10 <dbl> 0.036296910, 0.337456100, 0.035923530, 0.428432700, 0.…
## $ q25 <dbl> 0.069737625, 0.529555250, 0.057754575, 0.485222750, 0.…
## $ q50 <dbl> 0.12261000, 0.81107350, 0.08986235, 0.55071450, 0.2123…
## $ q75 <dbl> 0.18882625, 1.16598250, 0.13248400, 0.62804075, 0.2571…
## $ q90 <dbl> 0.27889410, 1.59615500, 0.18247880, 0.69566790, 0.3052…
## $ mean <dbl> 0.14498034, 0.90473102, 0.10181685, 0.55933882, 0.2186…
## $ customer_lambda <dbl> 0.018224807, 0.026671151, 0.018550612, 0.649189558, 0.…
Now that we have summary stats of the two distributions we segment our data
into the different groups of known values for lambda and then plot the
range of values for the different distributions.
plotdata_tbl <- comparison_summary_tbl %>%
mutate(
lambda_segment = case_when(
(customer_lambda > 0.00 & customer_lambda <= 0.30) ~ "SEG1",
(customer_lambda > 0.30 & customer_lambda <= 0.60) ~ "SEG2",
(customer_lambda > 0.60 & customer_lambda <= 0.90) ~ "SEG3",
(customer_lambda > 0.90 & customer_lambda <= 1.20) ~ "SEG4",
(customer_lambda > 1.20) ~ "SEG5",
TRUE ~ "ERROR"
)
) %>%
group_nest(lambda_segment) %>%
mutate(
sample_data = map(data, ~ .x %>% group_nest(customer_id) %>% slice_sample(n = 10))
) %>%
select(lambda_segment, sample_data) %>%
unnest(sample_data) %>%
unnest(data)
ggplot(plotdata_tbl %>% mutate(customer_id = str_trunc(customer_id, width = 7))) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 2) +
geom_point(
aes(x = customer_id, y = customer_lambda)) +
facet_wrap(vars(lambda_segment), scales = "free") +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 6),
legend.position = "bottom"
)And we also want to compare the two distributions for the posterior coverage.
coverage_stats_tbl <- comparison_summary_tbl %>%
mutate(
up_50 = customer_lambda > q75,
dn_50 = customer_lambda < q25,
up_80 = customer_lambda < q10,
dn_80 = customer_lambda > q90,
cover_50 = (!up_50 & !dn_50),
cover_80 = (!up_80 & !dn_80)
)
coverage_stats_tbl %>%
group_by(label) %>%
summarise(
.groups = "drop",
prop_50 = sum(cover_50) / n(),
prop_80 = sum(cover_80) / n(),
prop_up50 = sum(up_50) / n(),
prop_dn50 = sum(dn_50) / n(),
prop_up80 = sum(up_80) / n(),
prop_dn80 = sum(dn_80) / n()
)## # A tibble: 2 × 7
## label prop_50 prop_80 prop_up50 prop_dn50 prop_up80 prop_dn80
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Flat 0.424 0.711 0.107 0.469 0.257 0.032
## 2 Prior 0.466 0.78 0.21 0.324 0.14 0.08
4.4 Fitting with an Alternative Prior
We fit the model again with a narrower prior on lambda to investigate the
effect of this.
stan_modelname <- "freqmodel_prior_2"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
transmute(customer_id, btyd_count, tnx_weeks) %>%
compose_data(
r = 1,
alpha = 4
)
freqmodel_prior_2_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4205,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 6.3 seconds.
## Chain 4 finished in 5.9 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 6.2 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 6.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 6.2 seconds.
## Total execution time: 6.7 seconds.
freqmodel_prior_2_stanfit$summary()## # A tibble: 2,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -5.31e+3 -5.31e+3 22.8 23.5 -5.35e+3 -5.28e+3 1.00 714.
## 2 lambda[1] 1.13e-1 9.55e-2 0.0777 0.0687 2.08e-2 2.72e-1 0.999 2307.
## 3 lambda[2] 4.01e-1 3.59e-1 0.234 0.208 1.14e-1 8.43e-1 1.00 2189.
## 4 lambda[3] 9.05e-2 8.23e-2 0.0514 0.0475 2.37e-2 1.88e-1 1.00 2112.
## 5 lambda[4] 5.17e-1 5.10e-1 0.0946 0.0931 3.70e-1 6.83e-1 1.00 1590.
## 6 lambda[5] 2.00e-1 1.94e-1 0.0609 0.0608 1.11e-1 3.06e-1 1.00 2295.
## 7 lambda[6] 2.36e-1 2.31e-1 0.0661 0.0632 1.39e-1 3.58e-1 1.00 2296.
## 8 lambda[7] 4.81e-1 4.75e-1 0.0925 0.0927 3.44e-1 6.45e-1 1.00 2226.
## 9 lambda[8] 3.06e-1 3.02e-1 0.0746 0.0744 1.95e-1 4.35e-1 1.00 2761.
## 10 lambda[9] 7.37e-2 6.61e-2 0.0388 0.0332 2.50e-2 1.48e-1 1.00 2379.
## # … with 1,991 more rows, and 1 more variable: ess_tail <dbl>
Having fitted our model we now run our HMC diagnostics.
freqmodel_prior_2_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
We repeat our validation exercise from before, comparing the outputs of the
posterior distribution for each lambda against the known true value.
freqmodel_prior_2_validation_tbl <- freqmodel_prior_2_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_prior_2_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.1472300, 0.0445552, 0.2295150, 0.0672781, 0.1772340,…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
We now calculate the q-values of each parameter against the posterior distribution and plot a histogram of these q-values.
plotvalid_tbl <- freqmodel_prior_2_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)4.5 Construct the Comparison Plot
Once again we now construct the data to plot out the comparisons.
comparison_distrib_tbl <- list(
`Flat` = freqmodel_flat_obs_validation_tbl,
`Prior 1` = freqmodel_prior_validation_tbl,
`Prior 2` = freqmodel_prior_2_validation_tbl
) %>%
bind_rows(.id = "label")
comparison_distrib_tbl %>% glimpse()## Rows: 6,000,000
## Columns: 5
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.2040730, 0.0686289, 0.1629100, 0.3354840, 0.0417703,…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
We now can compare the outputs of the two posterior distributions based on using a uniform prior and Gamma prior.
comparison_summary_tbl <- comparison_distrib_tbl %>%
group_by(label, customer_id) %>%
summarise(
.groups = "drop",
labels = c("q10", "q25", "q50", "q75", "q90", "mean", "customer_lambda"),
vals = c(quantile(post_lambda, probs = c(0.1, 0.25, 0.50, 0.75, 0.90)),
mean(post_lambda),
unique(customer_lambda))
) %>%
pivot_wider(
names_from = labels,
values_from = vals
)
comparison_summary_tbl %>% glimpse()## Rows: 3,000
## Columns: 9
## $ label <chr> "Flat", "Flat", "Flat", "Flat", "Flat", "Flat", "Flat"…
## $ customer_id <chr> "00JP8DATDZ", "020O4TLBUO", "02AIZVL9MO", "03V7QNX3MY"…
## $ q10 <dbl> 0.036296910, 0.337456100, 0.035923530, 0.428432700, 0.…
## $ q25 <dbl> 0.069737625, 0.529555250, 0.057754575, 0.485222750, 0.…
## $ q50 <dbl> 0.12261000, 0.81107350, 0.08986235, 0.55071450, 0.2123…
## $ q75 <dbl> 0.18882625, 1.16598250, 0.13248400, 0.62804075, 0.2571…
## $ q90 <dbl> 0.27889410, 1.59615500, 0.18247880, 0.69566790, 0.3052…
## $ mean <dbl> 0.14498034, 0.90473102, 0.10181685, 0.55933882, 0.2186…
## $ customer_lambda <dbl> 0.018224807, 0.026671151, 0.018550612, 0.649189558, 0.…
Now that we have summary stats of the two distributions we segment our data
into the different groups of known values for lambda and then plot the
range of values for the different distributions.
plotdata_tbl <- comparison_summary_tbl %>%
mutate(
lambda_segment = case_when(
(customer_lambda > 0.00 & customer_lambda <= 0.30) ~ "SEG1",
(customer_lambda > 0.30 & customer_lambda <= 0.60) ~ "SEG2",
(customer_lambda > 0.60 & customer_lambda <= 0.90) ~ "SEG3",
(customer_lambda > 0.90 & customer_lambda <= 1.20) ~ "SEG4",
(customer_lambda > 1.20) ~ "SEG5",
TRUE ~ "ERROR"
)
) %>%
group_nest(lambda_segment) %>%
mutate(
sample_data = map(data, ~ .x %>% group_nest(customer_id) %>% slice_sample(n = 10))
) %>%
select(lambda_segment, sample_data) %>%
unnest(sample_data) %>%
unnest(data)
ggplot(plotdata_tbl %>% mutate(customer_id = str_trunc(customer_id, width = 7))) +
geom_errorbar(
aes(x = customer_id, ymin = q10, ymax = q90, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 0.5) +
geom_errorbar(
aes(x = customer_id, ymin = q25, ymax = q75, colour = label),
position = position_dodge(width = 0.75), width = 0, size = 1) +
geom_point(
aes(x = customer_id, y = customer_lambda)) +
facet_wrap(vars(lambda_segment), scales = "free") +
labs(
x = "Customer ID",
y = "Frequency Rate",
colour = "Distribution"
) +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 6),
legend.position = "bottom"
)And we also want to compare the two distributions for the posterior coverage.
coverage_stats_tbl <- comparison_summary_tbl %>%
mutate(
up_50 = customer_lambda > q75,
dn_50 = customer_lambda < q25,
up_80 = customer_lambda < q10,
dn_80 = customer_lambda > q90,
cover_50 = (!up_50 & !dn_50),
cover_80 = (!up_80 & !dn_80)
)
coverage_stats_tbl %>%
group_by(label) %>%
summarise(
.groups = "drop",
prop_50 = sum(cover_50) / n(),
prop_80 = sum(cover_80) / n(),
prop_up50 = sum(up_50) / n(),
prop_dn50 = sum(dn_50) / n(),
prop_up80 = sum(up_80) / n(),
prop_dn80 = sum(dn_80) / n()
)## # A tibble: 3 × 7
## label prop_50 prop_80 prop_up50 prop_dn50 prop_up80 prop_dn80
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Flat 0.424 0.711 0.107 0.469 0.257 0.032
## 2 Prior 1 0.466 0.78 0.21 0.324 0.14 0.08
## 3 Prior 2 0.462 0.757 0.159 0.379 0.185 0.058
5 Construct Hierarchical Frequency Model
In our previous models we fixed the parameters for our Gamma hyper-distribution and passed those values in as ‘data’ to the model.
While this worked from a fitting perspective, it has the problem that it required a strong knowledge of what those parameters should be, whereas a more Bayesian approach would instead be to capture this uncertainty by putting distributions around those parameters.
This issue leads to another concern when dealing with fitting probability distributions: the parameters are often co-dependent. This is because, unlike some of the more common distributions, many distributions are not naturally parametrised by location parameters.
Instead, we often try to fit distributions using location and dispersion, and then convert these back to the standard parametrisation.
For the Gamma distribution, we can fit a location parameter \(\mu\) and the co-efficient of variation \(c_v\) (the standard-deviation divided by the mean). These then convert to the shape parameter \(r\) and rate parameter \(\alpha\) using the following relationship:
\[\begin{eqnarray*} r &=& \frac{1}{c_v^2} \\ \alpha &=& \frac{1}{\mu c_v^2} \end{eqnarray*}\]
We do this in Stan by using the transformed parameters block to convert
\((\mu, c_v)\) into \((r, \alpha)\), so we can use those fitting the models.
Alternatively, we could parameterise the Gamma with the mean \(\mu\) and the shape parameter \(r\), calculating the rate parameter \(\alpha\) as
\[ \alpha = \frac{r}{\mu} \]
## data {
## int<lower=1> n; // count of observations
##
## array[n] int<lower=0> btyd_count; // observed number of transactions
## vector<lower=0>[n] tnx_weeks; // observed time-period of transactions
##
## real<lower=0> mean_p1, mean_p2;
## real<lower=0> cov_p1, cov_p2;
## }
##
## parameters {
## real<lower=0> hier_mean;
## real<lower=0> hier_cov;
##
## vector<lower=0>[n] lambda;
## }
##
## transformed parameters {
## // shape <- 1 / (cv^2)
## // scale <- mu * cv^2
##
## real<lower=0> r = 1 / (hier_cov * hier_cov);
## real<lower=0> alpha = 1 / (hier_mean * hier_cov * hier_cov);
## }
##
##
## model {
## hier_mean ~ gamma(mean_p1, mean_p1);
## hier_cov ~ gamma(cov_p1, cov_p2);
##
## lambda ~ gamma(r, alpha);
##
## target += poisson_lpmf(btyd_count | lambda .* tnx_weeks);
## }
##
## generated quantities {
## vector[n] log_lik;
##
## for (i in 1:n) {
## log_lik[i] = poisson_lpmf(btyd_count[i] | lambda[i] * tnx_weeks[i]);
## }
## }
We now need to think about the parameters for the hierarchical priors, so once again we try something like we had before.
x_vals <- seq(0, 1.5, by = 0.01)
y_vals <- dgamma(x_vals, shape = 1, rate = 3)
ggplot() +
geom_line(aes(x = x_vals, y = y_vals)) +
labs(
x = "mu",
y = "Density",
title = "Distribution for Prior on mu"
)This gives us a good spread of values and so should work fine.
We do a similar task for the spread of values of the co-efficient of variation, which we will estimate to be around 1.
x_vals <- seq(0, 3, by = 0.01)
y_vals <- dgamma(x_vals, shape = 2, rate = 2)
ggplot() +
geom_line(aes(x = x_vals, y = y_vals)) +
labs(
x = "Coefficient of Variation",
y = "Density",
title = "Distribution for Prior on c_v"
)5.1 Compile and Fit First Hierarchical Model
We first need to compile this fitted model.
freqmodel_hier_stanmodel <- cmdstan_model(
"stan_code/freqmodel_hier.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "freqmodel_hier"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- fit_1000_data_tbl %>%
transmute(customer_id, btyd_count, tnx_weeks) %>%
compose_data(
mean_p1 = 1,
mean_p2 = 3,
cov_p1 = 2,
cov_p2 = 2
)
freqmodel_hier_stanfit <- freqmodel_hier_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4206,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 8.2 seconds.
## Chain 4 finished in 8.1 seconds.
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 8.8 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 9.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 8.5 seconds.
## Total execution time: 9.4 seconds.
freqmodel_hier_stanfit$summary()## # A tibble: 2,005 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -3.90e+3 -3.90e+3 2.47e+1 2.47e+1 -3.94e+3 -3.86e+3 1.01 521.
## 2 hier_mean 2.84e-1 2.84e-1 8.80e-3 8.77e-3 2.70e-1 2.99e-1 1.00 2771.
## 3 hier_cov 9.31e-1 9.31e-1 2.25e-2 2.29e-2 8.93e-1 9.67e-1 1.00 1361.
## 4 lambda[1] 1.21e-1 1.00e-1 8.44e-2 7.35e-2 2.26e-2 2.92e-1 1.00 1744.
## 5 lambda[2] 4.19e-1 3.75e-1 2.36e-1 2.12e-1 1.27e-1 8.57e-1 0.999 1911.
## 6 lambda[3] 9.49e-2 8.41e-2 5.40e-2 4.78e-2 2.94e-2 1.95e-1 1.00 2348.
## 7 lambda[4] 5.17e-1 5.08e-1 9.76e-2 9.51e-2 3.65e-1 6.85e-1 1.00 2699.
## 8 lambda[5] 2.04e-1 1.98e-1 6.15e-2 5.69e-2 1.13e-1 3.18e-1 1.01 2925.
## 9 lambda[6] 2.39e-1 2.32e-1 7.01e-2 6.80e-2 1.39e-1 3.61e-1 1.00 2218.
## 10 lambda[7] 4.87e-1 4.80e-1 9.36e-2 8.99e-2 3.41e-1 6.50e-1 1.00 1977.
## # … with 1,995 more rows, and 1 more variable: ess_tail <dbl>
We also want to check the various HMC diagnostics.
freqmodel_hier_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_hier-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_hier-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_hier-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_hier-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
The HMC diagnostics seem to check out, so before we look at individual values
for lambda we want to check the posterior distributions for the hierarchical
priors.
posterior_params_tbl <- freqmodel_hier_stanfit %>%
tidy_draws() %>%
select(.draw, hier_mean, hier_cov, r, alpha) %>%
pivot_longer(
cols = !.draw,
names_to = "parameter",
values_to = "value"
)
ggplot(posterior_params_tbl) +
geom_histogram(aes(x = value), bins = 50) +
facet_wrap(vars(parameter), ncol = 2, scales = "free_x") +
labs(
x = "Parameter Value",
y = "Count",
title = "Histograms of the Hierarchical Posteriors"
)5.2 Perform Basic Model Validation
As before, we perform some cursory model validation based on some of the data we have.
freqmodel_hier_validation_tbl <- freqmodel_hier_stanfit %>%
recover_types(fit_1000_data_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, post_lambda = lambda, customer_lambda)
freqmodel_hier_validation_tbl %>% glimpse()## Rows: 2,000,000
## Columns: 4
## $ customer_id <chr> "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ", "00JP8DATDZ"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ post_lambda <dbl> 0.0809477, 0.1025150, 0.1096440, 0.0309810, 0.3111300,…
## $ customer_lambda <dbl> 0.01822481, 0.01822481, 0.01822481, 0.01822481, 0.0182…
We now calculate the q-values of each parameter against the posterior distribution and plot a histogram of these q-values.
plotvalid_tbl <- freqmodel_hier_validation_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for the Posterior Distribution"
)6 Validating the Frequency Models
We now turn our attention to validating our model fits.
For the purposes of this work, we work on the dataset of 10,000 customers over a subset of the time, then use the model to predict the transaction count for the remainder of the time period. This should help us validate the quality of our model.
For the purposes of this work, we fit our model on the data from the first nine months of the year.
break_date <- as.Date("2019-09-30")
full_weeks <- difftime(break_date, as.Date("2019-01-01"), units = "weeks") %>%
as.numeric()
training_tbl <- customer_transactions_tbl %>%
filter(
customer_id %in% id_10000,
tnx_timestamp <= break_date
)
training_tbl %>% glimpse()## Rows: 107,957
## Columns: 5
## $ customer_id <chr> "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", …
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-01-01 23:31:04, 2019-01-05 10:22:11, 2019-01-09 22…
## $ tnx_amount <dbl> 76.52, 85.76, 72.59, 69.78, 73.71, 83.74, 2807.96, 3221.…
test_tbl <- customer_transactions_tbl %>%
filter(
customer_id %in% id_10000,
tnx_timestamp > break_date
)
test_tbl %>% glimpse()## Rows: 33,408
## Columns: 5
## $ customer_id <chr> "003L6XXKCJ", "003L6XXKCJ", "003L6XXKCJ", "003L6XXKCJ", …
## $ cohort_qtr <chr> "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "2019 Q1", "…
## $ cohort_ym <chr> "2019 01", "2019 01", "2019 01", "2019 01", "2019 01", "…
## $ tnx_timestamp <dttm> 2019-10-08 04:44:24, 2019-10-15 18:55:38, 2019-10-18 07…
## $ tnx_amount <dbl> 444.42, 480.46, 482.32, 432.14, 454.97, 386.85, 463.11, …
We need to reconstruct the summary statistics
training_stats_tbl <- training_tbl %>%
calculate_transaction_summary_stats() %>%
transmute(
customer_id, tnx_weeks = full_weeks, btyd_count
)
training_stats_tbl %>% glimpse()## Rows: 10,000
## Columns: 3
## $ customer_id <chr> "0010G0VY0J", "001YCS74VX", "003L6XXKCJ", "00G07U1V2X", "0…
## $ tnx_weeks <dbl> 38.85714, 38.85714, 38.85714, 38.85714, 38.85714, 38.85714…
## $ btyd_count <dbl> 5, 5, 58, 0, 7, 1, 31, 18, 2, 3, 14, 17, 22, 9, 0, 5, 11, …
test_stats_tbl <- test_tbl %>% calculate_transaction_summary_stats()
test_stats_tbl %>% glimpse()## Rows: 7,685
## Columns: 9
## $ customer_id <chr> "003L6XXKCJ", "00PBR5EPJ3", "00TEMZ78N2", "00XJ53FG5R", "…
## $ tnx_count <int> 18, 16, 8, 2, 2, 9, 7, 3, 1, 4, 4, 12, 4, 3, 3, 12, 12, 4…
## $ first_tnx_ts <dttm> 2019-10-08 04:44:24, 2019-10-06 18:28:48, 2019-10-06 15:…
## $ last_tnx_ts <dttm> 2019-12-19 06:28:23, 2019-12-28 15:46:58, 2019-12-23 00:…
## $ btyd_count <dbl> 17, 15, 7, 1, 1, 8, 6, 2, 0, 3, 3, 11, 3, 2, 2, 11, 11, 3…
## $ all_weeks <dbl> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
## $ tnx_weeks <dbl> 10.296030, 11.841087, 11.048979, 10.288746, 6.642898, 8.3…
## $ obs_freq <dbl> 1.65112178, 1.26677554, 0.63354270, 0.09719357, 0.1505367…
## $ emp_freq <dbl> 0.32692308, 0.28846154, 0.13461538, 0.01923077, 0.0192307…
6.1 Validating the Flat Model Fit
Now that we have set up our datasets we fit our model on the first nine months of the year.
This is the exact same model as before, so we do not need to recompile the Stan model, we just fit it with the subsetted data.
stan_modelname <- "freqmodel_flat_subset"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
compose_data()
freqmodel_flat_subset_stanfit <- freqmodel_flat_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4207,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 89.6 seconds.
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 90.6 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 91.4 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 91.4 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 90.8 seconds.
## Total execution time: 91.7 seconds.
freqmodel_flat_subset_stanfit$summary()## # A tibble: 20,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -4.15e+4 -4.15e+4 73.6 73.4 -4.16e+4 -4.14e+4 1.00 718.
## 2 lambda[1] 1.56e-1 1.47e-1 0.0645 0.0587 6.64e-2 2.73e-1 1.00 2300.
## 3 lambda[2] 1.53e-1 1.46e-1 0.0638 0.0618 6.39e-2 2.68e-1 1.00 3184.
## 4 lambda[3] 1.52e+0 1.52e+0 0.192 0.193 1.22e+0 1.85e+0 1.00 3267.
## 5 lambda[4] 2.59e-2 1.80e-2 0.0254 0.0188 1.38e-3 7.53e-2 0.999 2441.
## 6 lambda[5] 2.04e-1 1.96e-1 0.0701 0.0661 1.04e-1 3.30e-1 1.00 2694.
## 7 lambda[6] 4.98e-2 4.23e-2 0.0352 0.0299 8.66e-3 1.17e-1 1.00 2285.
## 8 lambda[7] 8.19e-1 8.11e-1 0.145 0.149 5.97e-1 1.07e+0 1.00 2655.
## 9 lambda[8] 4.86e-1 4.79e-1 0.111 0.109 3.22e-1 6.85e-1 1.00 2909.
## 10 lambda[9] 7.70e-2 6.83e-2 0.0447 0.0407 2.03e-2 1.63e-1 1.00 2439.
## # … with 19,991 more rows, and 1 more variable: ess_tail <dbl>
We need to check the HMC diagnostics for this fit.
freqmodel_flat_subset_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_flat_subset-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_flat_subset-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_flat_subset-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_flat_subset-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
We use our fitted model to predict the transaction frequency for each customer, then use this to predict a range of transaction counts for each customer in the subsequent three months, finally comparing it to the observed counts.
tnx_window <- difftime(as.Date("2019-12-31"), as.Date("2019-10-01"), units = "weeks") %>%
as.numeric()
freqmodel_flat_subset_predict_tbl <- freqmodel_flat_subset_stanfit %>%
recover_types(training_stats_tbl) %>%
spread_draws(lambda[customer_id]) %>%
ungroup() %>%
mutate(
pred_count = rpois(n(), lambda = lambda * tnx_window)
)
freqmodel_flat_subset_predict_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 6
## $ customer_id <chr> "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", "0…
## $ lambda <dbl> 0.2839830, 0.1164000, 0.1570210, 0.1270920, 0.1842610, 0.2…
## $ .chain <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ .iteration <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ .draw <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ pred_count <int> 3, 1, 1, 3, 2, 5, 0, 0, 2, 4, 6, 2, 2, 2, 0, 3, 1, 0, 1, 2…
As we did before, we now combine these predicted transaction counts with what was observed the dataset and we see if the q-values give us a uniform distribution across customers.
freqmodel_flat_predict_valid_tbl <- freqmodel_flat_subset_predict_tbl %>%
inner_join(customer_simparams_tbl, by = "customer_id") %>%
left_join(test_stats_tbl, by = "customer_id") %>%
select(customer_id, draw_id = .draw, customer_lambda, lambda, pred_count, obs_count = btyd_count) %>%
replace_na(list(obs_count = 0))
freqmodel_flat_predict_valid_tbl %>% glimpse()## Rows: 20,000,000
## Columns: 6
## $ customer_id <chr> "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", "0010G0VY0J"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ customer_lambda <dbl> 0.2207705, 0.2207705, 0.2207705, 0.2207705, 0.2207705,…
## $ lambda <dbl> 0.2839830, 0.1164000, 0.1570210, 0.1270920, 0.1842610,…
## $ pred_count <int> 3, 1, 1, 3, 2, 5, 0, 0, 2, 4, 6, 2, 2, 2, 0, 3, 1, 0, …
## $ obs_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
We now use this data to construct a histogram of the q-vals and compare that with what we would expect from a uniform distribution.
We first start by comparing the posterior distribution of lambda with the
known underlying value we used to generate the data.
plotvalid_tbl <- freqmodel_flat_predict_valid_tbl %>%
calculate_distribution_qvals(lambda, customer_lambda, customer_id)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for Simulated Transaction Frequencies"
)We can also repeat this analysis for the observed transaction counts.
plotvalid_tbl <- freqmodel_flat_predict_valid_tbl %>%
calculate_distribution_qvals(pred_count, obs_count, customer_id, customer_lambda)
ggplot(plotvalid_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = plotvalid_tbl %>% nrow() %>% divide_by(50)), colour = "red") +
labs(
x = "Quantile",
y = "Count",
title = "Quantile Plot of the q-Values for Simulated Transaction Counts"
)As this work will be repeated for multiple models, we wrap all the above logic
into a function validate_frequency_model to perform the various steps
required to construct the validation data, and we need some tables of data
set up as well.
customer_lambda_tbl <- customer_simparams_tbl %>%
select(customer_id, customer_lambda)
obs_data_tbl <- test_stats_tbl %>%
select(customer_id, obs_count = tnx_count)
input_validation_tbl <- training_stats_tbl %>%
inner_join(customer_lambda_tbl, by = "customer_id") %>%
left_join(obs_data_tbl, by = "customer_id") %>%
replace_na(
list(obs_count = 0)
) %>%
transmute(
customer_id = customer_id,
customer_lambda = customer_lambda,
tnx_window = tnx_window,
obs_count = obs_count
)
input_validation_tbl %>% glimpse()## Rows: 10,000
## Columns: 4
## $ customer_id <chr> "0010G0VY0J", "001YCS74VX", "003L6XXKCJ", "00G07U1V2X"…
## $ customer_lambda <dbl> 0.220770473, 0.066357589, 1.447000773, 0.026214628, 0.…
## $ tnx_window <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13…
## $ obs_count <int> 0, 0, 18, 0, 0, 0, 16, 8, 0, 2, 2, 9, 7, 3, 0, 1, 4, 4…
6.2 Validating Out-of-Sample Frequency Models
As we will be repeating the above logic for each of the models constructed, we
write a function called validate_frequency_model to construct this data and
allow us to compare multiple models at once.
We now need to refit these models with our ‘training’ data like we did before
with our flat model in the previous section.
6.2.1 Gamma(0.25, 1) Prior
We refit our models with a Gamma prior with fixed parameters, first fitting with a \(\Gamma(0.25, 1)\) prior.
stan_modelname <- "freqmodel_prior_subset"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
compose_data(
r = 0.25,
alpha = 1
)
freqmodel_prior_subset_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4208,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 204.4 seconds.
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 finished in 211.0 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 216.7 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 219.7 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 212.9 seconds.
## Total execution time: 220.4 seconds.
freqmodel_prior_subset_stanfit$summary()## # A tibble: 20,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -30451. -30451. 76.2 73.3 -3.06e+4 -30326. 1.01 558. 916.
## 2 lambda[1] 0.131 0.123 0.0581 0.0552 5.16e-2 0.241 1.01 3678. 1274.
## 3 lambda[2] 0.132 0.125 0.0564 0.0529 5.53e-2 0.236 1.00 3681. 1403.
## 4 lambda[3] 1.47 1.46 0.186 0.189 1.18e+0 1.79 1.00 4466. 1715.
## 5 lambda[4] 0.00605 0.00110 0.0119 0.00164 1.32e-7 0.0275 1.00 1101. 827.
## 6 lambda[5] 0.184 0.175 0.0704 0.0669 8.80e-2 0.312 1.00 4067. 1152.
## 7 lambda[6] 0.0309 0.0221 0.0286 0.0205 2.44e-3 0.0874 1.00 2575. 1129.
## 8 lambda[7] 0.782 0.774 0.140 0.145 5.70e-1 1.02 1.01 3634. 1640.
## 9 lambda[8] 0.460 0.455 0.103 0.101 3.01e-1 0.641 1.00 3960. 1469.
## 10 lambda[9] 0.0562 0.0483 0.0378 0.0334 1.02e-2 0.129 1.00 3019. 1250.
## # … with 19,991 more rows
Having fit with this data, we need to check the HMC diagnostics.
freqmodel_prior_subset_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_prior_subset-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_subset-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_subset-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_subset-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
6.2.2 Gamma(1, 4) Prior
We also refit the model with the less diffused \(\Gamma(1, 4)\) prior.
stan_modelname <- "freqmodel_prior_2_subset"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
compose_data(
r = 1,
alpha = 4
)
freqmodel_prior_2_subset_stanfit <- freqmodel_prior_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4209,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 97.3 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 99.3 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 100.4 seconds.
## Chain 4 finished in 100.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 99.3 seconds.
## Total execution time: 101.2 seconds.
freqmodel_prior_2_subset_stanfit$summary()## # A tibble: 20,001 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -52087. -52086. 71.9 70.1 -52205. -51967. 1.00 876. 1063.
## 2 lambda[1] 0.141 0.132 0.0591 0.0579 0.0602 0.250 1.00 3107. 1282.
## 3 lambda[2] 0.141 0.134 0.0553 0.0540 0.0650 0.243 1.00 2485. 1303.
## 4 lambda[3] 1.38 1.37 0.193 0.185 1.08 1.71 1.01 3939. 1272.
## 5 lambda[4] 0.0234 0.0169 0.0227 0.0176 0.00121 0.0689 1.00 2206. 859.
## 6 lambda[5] 0.187 0.178 0.0683 0.0665 0.0893 0.312 0.999 2932. 1218.
## 7 lambda[6] 0.0460 0.0387 0.0316 0.0278 0.00918 0.109 1.00 2524. 1338.
## 8 lambda[7] 0.752 0.742 0.133 0.137 0.549 0.981 1.00 3443. 1420.
## 9 lambda[8] 0.443 0.437 0.104 0.101 0.291 0.629 1.01 2717. 1391.
## 10 lambda[9] 0.0695 0.0627 0.0391 0.0366 0.0199 0.144 1.01 2695. 1494.
## # … with 19,991 more rows
And as before we check the HMC diagnostics.
freqmodel_prior_2_subset_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2_subset-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2_subset-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2_subset-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_prior_2_subset-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
6.2.3 Hierarchical Model
Finally, we fit the hierarchical model so the parameters for the Gamma prior is also inferred from the data.
stan_modelname <- "freqmodel_hier_subset"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- training_stats_tbl %>%
compose_data(
mean_p1 = 1,
mean_p2 = 3,
cov_p1 = 2,
cov_p2 = 2
)
freqmodel_hier_subset_stanfit <- freqmodel_hier_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4210,
output_dir = stan_modeldir,
output_basename = stanfit_prefix
)## Running MCMC with 4 parallel chains...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 183.7 seconds.
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 188.1 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 189.8 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 191.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 188.2 seconds.
## Total execution time: 192.0 seconds.
freqmodel_hier_subset_stanfit$summary()## # A tibble: 20,005 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -38216. -38215. 76.5 79.2 -38342. -38092. 1.00 566. 950.
## 2 hier_mean 0.252 0.252 0.00269 0.00271 0.248 0.257 1.00 2950. 1453.
## 3 hier_cov 0.998 0.998 0.00786 0.00751 0.985 1.01 1.01 1192. 1407.
## 4 lambda[1] 0.139 0.130 0.0567 0.0533 0.0608 0.244 1.00 2719. 1186.
## 5 lambda[2] 0.139 0.130 0.0586 0.0561 0.0596 0.249 1.00 2611. 1167.
## 6 lambda[3] 1.38 1.37 0.179 0.177 1.09 1.69 1.00 2814. 1178.
## 7 lambda[4] 0.0234 0.0159 0.0237 0.0168 0.000964 0.0701 1.00 1751. 874.
## 8 lambda[5] 0.190 0.182 0.0670 0.0643 0.0922 0.312 1.01 2727. 1122.
## 9 lambda[6] 0.0481 0.0411 0.0326 0.0277 0.00877 0.110 1.01 2597. 794.
## 10 lambda[7] 0.746 0.738 0.126 0.117 0.544 0.964 1.00 3474. 1400.
## # … with 19,995 more rows
And as before we check the HMC diagnostics.
freqmodel_hier_subset_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_freqmodel_hier_subset-1.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_hier_subset-2.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_hier_subset-3.csv, /home/rstudio/workshop/stan_models/fit_freqmodel_hier_subset-4.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
6.2.4 Construct Model Comparison Plots
We now use each of these fitted models to construct a comparison dataset,
comparing both the known lambda value and observed transaction count against
the posterior distributions for both those values.
fit_comparison_tbl <- list(
`Flat Prior` = freqmodel_flat_subset_stanfit,
`Gamma(0.25, 1)` = freqmodel_prior_subset_stanfit,
`Gamma(1, 4)` = freqmodel_prior_2_subset_stanfit,
`Hierarchical` = freqmodel_hier_subset_stanfit
) %>%
enframe(name = "model_label", value = "stanfit") %>%
mutate(
valid_data = map(
stanfit, validate_frequency_model,
input_data_tbl = input_validation_tbl
)
)
fit_comparison_tbl %>% glimpse() ## Rows: 4
## Columns: 3
## $ model_label <chr> "Flat Prior", "Gamma(0.25, 1)", "Gamma(1, 4)", "Hierarchic…
## $ stanfit <list> <<CmdStanMCMC>
## Inherits from: <CmdStanFit>
## Public:
## c…
## $ valid_data <list> [<tbl_df[20000000 x 6]>], [<tbl_df[20000000 x 6]>], [<tbl…
We now now use these datasets and our other validation routines to check the
q-vals for both lambda and transaction count.
validation_data_tbl <- fit_comparison_tbl %>%
select(model_label, valid_data) %>%
unnest(valid_data)
validation_data_tbl %>% glimpse()## Rows: 80,000,000
## Columns: 7
## $ model_label <chr> "Flat Prior", "Flat Prior", "Flat Prior", "Flat Prior"…
## $ customer_id <chr> "0010G0VY0J", "0010G0VY0J", "0010G0VY0J", "0010G0VY0J"…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ customer_lambda <dbl> 0.2207705, 0.2207705, 0.2207705, 0.2207705, 0.2207705,…
## $ post_lambda <dbl> 0.2839830, 0.1164000, 0.1570210, 0.1270920, 0.1842610,…
## $ pred_count <int> 5, 0, 1, 6, 2, 3, 3, 1, 3, 0, 4, 7, 2, 2, 1, 1, 0, 1, …
## $ obs_count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
First we plot the comparison histograms for lambda:
valid_lambda_tbl <- validation_data_tbl %>%
calculate_distribution_qvals(post_lambda, customer_lambda, model_label, customer_id)
ggplot(valid_lambda_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = 200), colour = "red") +
facet_wrap(vars(model_label), ncol = 1) +
labs(
x = "Quantile",
y = "Count",
title = "Comparison Histogram of q-vals for Lambda"
)valid_count_tbl <- validation_data_tbl %>%
calculate_distribution_qvals(pred_count, obs_count, model_label, customer_id)
ggplot(valid_count_tbl) +
geom_histogram(aes(x = q_val), bins = 50) +
geom_hline(aes(yintercept = 200), colour = "red") +
facet_wrap(vars(model_label), ncol = 1) +
labs(
x = "Quantile",
y = "Count",
title = "Comparison Histogram of q-vals for Count"
)7 Write Data to Disk
We now want to write a number of these values to disk
customer_summarystats_tbl %>% write_rds("data/synthdata_singleyear_summarystats_tbl.rds")
validation_data_tbl %>% write_rds("data/synthdata_singleyear_subset_comparison_distrib_tbl.rds")8 R Environment
options(width = 120L)
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.1.3 (2022-03-10)
## os Ubuntu 20.04.4 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-05-11
## pandoc 2.17.1.1 @ /usr/local/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] RSPM (R 4.1.0)
## arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.1.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0)
## backports 1.4.1 2021-12-13 [1] RSPM (R 4.1.0)
## base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.1.0)
## bayesplot * 1.9.0 2022-03-10 [1] RSPM (R 4.1.0)
## bit 4.0.4 2020-08-04 [1] RSPM (R 4.1.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.1.0)
## bookdown 0.26 2022-04-15 [1] RSPM (R 4.1.0)
## boot 1.3-28 2021-05-03 [2] CRAN (R 4.1.3)
## bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.1.0)
## brms * 2.17.0 2022-05-11 [1] Github (paul-buerkner/brms@a43937c)
## Brobdingnag 1.2-7 2022-02-03 [1] RSPM (R 4.1.0)
## broom 0.8.0 2022-04-13 [1] RSPM (R 4.1.0)
## bslib 0.3.1 2021-10-06 [1] RSPM (R 4.1.0)
## cachem 1.0.6 2021-08-19 [1] RSPM (R 4.1.0)
## callr 3.7.0 2021-04-20 [1] RSPM (R 4.1.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.1.0)
## checkmate 2.0.0 2020-02-06 [1] RSPM (R 4.1.0)
## cli 3.2.0 2022-02-14 [1] RSPM (R 4.1.0)
## cmdstanr * 0.5.2 2022-05-11 [1] Github (stan-dev/cmdstanr@9bbc4eb)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.1.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.3)
## colorspace 2.0-3 2022-02-21 [1] RSPM (R 4.1.0)
## colourpicker 1.1.1 2021-10-04 [1] RSPM (R 4.1.0)
## conflicted * 1.1.0 2021-11-26 [1] RSPM (R 4.1.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.1.0)
## crayon 1.5.1 2022-03-26 [1] RSPM (R 4.1.0)
## crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.1.0)
## curl 4.3.2 2021-06-23 [1] RSPM (R 4.1.0)
## DBI 1.1.2 2021-12-20 [1] RSPM (R 4.1.0)
## dbplyr 2.1.1 2021-04-06 [1] RSPM (R 4.1.0)
## digest 0.6.29 2021-12-01 [1] RSPM (R 4.1.0)
## directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.1.0)
## distributional 0.3.0 2022-01-05 [1] RSPM (R 4.1.0)
## dplyr * 1.0.8 2022-02-08 [1] RSPM (R 4.1.0)
## DT 0.22 2022-03-28 [1] RSPM (R 4.1.0)
## dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.1.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0)
## evaluate 0.15 2022-02-18 [1] RSPM (R 4.1.0)
## fansi 1.0.3 2022-03-24 [1] RSPM (R 4.1.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.1.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.1.0)
## fs * 1.5.2 2021-12-08 [1] RSPM (R 4.1.0)
## furrr * 0.2.3 2021-06-25 [1] RSPM (R 4.1.0)
## future * 1.24.0 2022-02-19 [1] RSPM (R 4.1.0)
## gamm4 0.2-6 2020-04-03 [1] RSPM (R 4.1.0)
## generics 0.1.2 2022-01-31 [1] RSPM (R 4.1.0)
## ggdist 3.1.1 2022-02-27 [1] RSPM (R 4.1.0)
## ggplot2 * 3.3.5 2021-06-25 [1] RSPM (R 4.1.0)
## ggridges 0.5.3 2021-01-08 [1] RSPM (R 4.1.0)
## globals 0.14.0 2020-11-22 [1] RSPM (R 4.1.0)
## glue * 1.6.2 2022-02-24 [1] RSPM (R 4.1.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.1.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.1.0)
## gtools 3.9.2 2021-06-06 [1] RSPM (R 4.1.0)
## haven 2.5.0 2022-04-15 [1] RSPM (R 4.1.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.1.0)
## hms 1.1.1 2021-09-26 [1] RSPM (R 4.1.0)
## htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.1.0)
## htmlwidgets 1.5.4 2021-09-08 [1] RSPM (R 4.1.0)
## httpuv 1.6.5 2022-01-05 [1] RSPM (R 4.1.0)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.1.0)
## igraph 1.3.1 2022-04-20 [1] RSPM (R 4.1.0)
## inline 0.3.19 2021-05-31 [1] RSPM (R 4.1.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.1.0)
## jsonlite 1.8.0 2022-02-22 [1] RSPM (R 4.1.0)
## knitr 1.38 2022-03-25 [1] RSPM (R 4.1.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.1.0)
## later 1.3.0 2021-08-18 [1] RSPM (R 4.1.0)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.1.3)
## lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.1.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.1.0)
## lme4 1.1-29 2022-04-07 [1] RSPM (R 4.1.0)
## loo 2.5.1 2022-03-24 [1] RSPM (R 4.1.0)
## lubridate 1.8.0 2021-10-07 [1] RSPM (R 4.1.0)
## magrittr * 2.0.3 2022-03-30 [1] RSPM (R 4.1.0)
## markdown 1.1 2019-08-07 [1] RSPM (R 4.1.0)
## MASS 7.3-55 2022-01-16 [2] CRAN (R 4.1.3)
## Matrix 1.4-0 2021-12-08 [2] CRAN (R 4.1.3)
## matrixStats 0.62.0 2022-04-19 [1] RSPM (R 4.1.0)
## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.1.0)
## mgcv 1.8-39 2022-02-24 [2] CRAN (R 4.1.3)
## mime 0.12 2021-09-28 [1] RSPM (R 4.1.0)
## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.1.0)
## minqa 1.2.4 2014-10-09 [1] RSPM (R 4.1.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.1.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.1.0)
## mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.1.0)
## nlme 3.1-155 2022-01-16 [2] CRAN (R 4.1.3)
## nloptr 2.0.0 2022-01-26 [1] RSPM (R 4.1.0)
## parallelly 1.31.0 2022-04-07 [1] RSPM (R 4.1.0)
## pillar 1.7.0 2022-02-01 [1] RSPM (R 4.1.0)
## pkgbuild 1.3.1 2021-12-20 [1] RSPM (R 4.1.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0)
## plyr 1.8.7 2022-03-24 [1] RSPM (R 4.1.0)
## posterior * 1.2.1 2022-03-07 [1] RSPM (R 4.1.0)
## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.1.0)
## processx 3.5.3 2022-03-25 [1] RSPM (R 4.1.0)
## projpred 2.1.1 2022-04-03 [1] RSPM (R 4.1.0)
## promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.1.0)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.1.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.1.0)
## quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.1.0)
## R6 2.5.1 2021-08-19 [1] RSPM (R 4.1.0)
## Rcpp * 1.0.8.3 2022-03-17 [1] RSPM (R 4.1.0)
## RcppParallel 5.1.5 2022-01-05 [1] RSPM (R 4.1.0)
## readr * 2.1.2 2022-01-30 [1] RSPM (R 4.1.0)
## readxl 1.4.0 2022-03-28 [1] RSPM (R 4.1.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0)
## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.1.0)
## rlang * 1.0.2 2022-03-04 [1] RSPM (R 4.1.0)
## rmarkdown 2.13 2022-03-10 [1] RSPM (R 4.1.0)
## rmdformats 1.0.3 2021-10-06 [1] RSPM (R 4.1.0)
## rstan 2.26.11 2022-05-11 [1] local
## rstantools 2.2.0 2022-04-08 [1] RSPM (R 4.1.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0)
## rvest 1.0.2 2021-10-16 [1] RSPM (R 4.1.0)
## sass 0.4.1 2022-03-23 [1] RSPM (R 4.1.0)
## scales * 1.2.0 2022-04-13 [1] RSPM (R 4.1.0)
## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.1.0)
## shiny 1.7.1 2021-10-02 [1] RSPM (R 4.1.0)
## shinyjs 2.1.0 2021-12-23 [1] RSPM (R 4.1.0)
## shinystan 2.6.0 2022-03-03 [1] RSPM (R 4.1.0)
## shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.1.0)
## StanHeaders 2.26.11 2022-05-11 [1] local
## stringi 1.7.6 2021-11-29 [1] RSPM (R 4.1.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.1.0)
## svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.1.0)
## tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.1.0)
## threejs 0.3.3 2020-01-21 [1] RSPM (R 4.1.0)
## tibble * 3.1.6 2021-11-07 [1] RSPM (R 4.1.0)
## tidybayes * 3.0.2 2022-01-05 [1] RSPM (R 4.1.0)
## tidyr * 1.2.0 2022-02-01 [1] RSPM (R 4.1.0)
## tidyselect 1.1.2 2022-02-21 [1] RSPM (R 4.1.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.1.0)
## tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.1.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0)
## V8 4.1.0 2022-02-06 [1] RSPM (R 4.1.0)
## vctrs 0.4.1 2022-04-13 [1] RSPM (R 4.1.0)
## vroom 1.5.7 2021-11-30 [1] RSPM (R 4.1.0)
## withr 2.5.0 2022-03-03 [1] RSPM (R 4.1.0)
## xfun 0.30 2022-03-02 [1] RSPM (R 4.1.0)
## xml2 1.3.3 2021-11-30 [1] RSPM (R 4.1.0)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.1.0)
## xts 0.12.1 2020-09-09 [1] RSPM (R 4.1.0)
## yaml 2.3.5 2022-02-21 [1] RSPM (R 4.1.0)
## zoo 1.8-10 2022-04-15 [1] RSPM (R 4.1.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
options(width = 80L)